The purpose of this project is to take a player’s indiviual statistics and attempt to determine whether they won or lost that game. I will then use this predictive capability and try to guess the winner of the largest tournament of the year, Call of Duty Champs.
Call of Duty is a first-person shooter that first began in 2003. Since then, it has become one of the largest multiplayer video game franchises to exist. During this time, a competitive scene for the game has gained traction. In 2016, the Call of Duty World League was born – a sponsored league that hosts major tournaments throughout the year for the best players in the world to play in. In these events, these pros play three different game modes to decide the winner of a series. These game modes are Hardpoint, Search and Destroy, and then a third game mode that often changes yearly. For the data that we are covering, the third game mode is Control. All of the teams in the league consist of 5 players, and the series are Best of 5’s.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s")
In Hardpoint, the two teams must fight over a point on the map where every second they spend in this point, they gain one point. This point is called the “Hardpoint.” If two teams are in the Hardpoint at the same time, then neither teams collects points. Every sixty seconds, the Hardpoint changes locations on the map, so teams must make tactical decisions to be able to rotate across the map. The first team to 250 points wins the map.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(6*60 + 35)
In Search and Destroy, the two teams play rounds where each player only has one life; if you die, you are dead until the next round. The objective is to either kill the entire other team before the time limit, or if you are on offense, then you can plant the bomb. If the bomb detonates after 45 seconds, then you also win the round. The first team to win 6 rounds wins the map.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(18*60 + 39)
In Control, there is an offense team and a defense team. There are multiple rounds where each team switches off between offense and defense. Each team has 30 lives per round. The first time to win three rounds wins the map. The offensive team is trying to either capture two points on the map, or eliminate all 30 lives of the other team. The defensive team is trying to either defend the two points before the time rounds out, or eliminate all 30 lives of the other team.
embed_url("https://www.youtube.com/watch?v=VQC0aZuGBFs&t=2740s") %>%
use_start_time(45*60 + 40)
This model is useful because it will allow us to see whether a player’s statistics may have contributed to a win or not. As a fan of COD Competitive, there is a lot of debate on statistics and it’s importance, so I wanted to look directly at the impact of a player’s statistics.
All the packages are loaded below.
library(readr)
library(tidyverse)
library(tidymodels)
library(janitor)
library(sqldf)
library(sjmisc)
library(vembedr)
library(rpart)
library(rpart.plot)
library(parsnip)
library(vip)
library(randomForest)
library(dplyr)
library(tidyr)
library(data.table)
library(randomForest)
library(datasets)
library(caret)
library(ROCR)
library(tune)
library(kknn)
library(RColorBrewer)
This project makes use of official CWL data that is uploaded on Github. All data is organized relatively cleanly and all missing data is reported.
proleague2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-05-proleague.csv"))
fortworth2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-03-17-fortworth.csv"))
london2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-05-05-london.csv"))
anaheim2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-06-16-anaheim.csv"))
proleagueFinals2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-07-21-proleague-finals.csv"))
# all stats for all major tournaments (EXCEPT CHAMPS) in BO4 (2019)
majors2019 <- rbind(proleague2019, fortworth2019, london2019, anaheim2019, proleagueFinals2019)
# champs will act as our test data; we will try and predict the winner
champs2019 <- read_csv(url("https://raw.githubusercontent.com/Activision/cwl-data/master/data/data-2019-08-18-champs.csv"))
In order to determine a win for a game, we will need to address Hardpoint, Search and Destroy, and the Control separately. Each of these gamemodes have different parameters, so we will have to fit models for each gamemode.
Hardpoint:
end_time — helpful for organizing games chronologicallymatch_id — helpful for getting rid of missing datawin — ‘1’ for a win and ‘0’ for a lossteam — player’s corresponding teamplayer — what player does the data correspond tomode — game modek_d — kill/death ratio; used to show overall impact on the mapassists — in addition to k/d, assists show overall support on the maprole — a role is determined for each player depending on their most common gun throughout the yeardamage_dealt — total damage done in the mapplayer_spm — score per minutex2_piece — number of two-pieces (two kills in quick succession)x3_piece — number of three-pieces (three kills in quick succession)x4_piece — number of four-pieces (four kills in quick succession)hill_time_s — hill time measured in secondshill_captures — shows activity on the maphill_defends — shows activity on the mapSearch and Destroy:
end_time — helpful for organizing games chronologicallymatch_id — helpful for getting rid of missing datawin — ‘1’ for a win and ‘0’ for a lossteam — player’s corresponding teamplayer — what player does the data correspond tomode — game modek_d — kill/death ratio; used to show overall impact on the mapassists — in addition to k/d, assists show overall support on the maprole — a role is determined for each player depending on their most common gun throughout the yeardamage_dealt — total damage done in the mapplayer_spm — score per minutex2_piece — number of two-pieces (two kills in quick succession)x3_piece — number of three-pieces (three kills in quick succession)x4_piece — number of four-pieces (four kills in quick succession)fb_round_ratio — ‘snd_firstbloods’/‘snd_rounds’bomb_sneak_defuses — sneak defuses are often in pivotal roundsbomb_plants — good indicator of rolebomb_defuses — good indicator of rolesnd_rounds — used to calculate fb_round_ratiosnd_firstbloods — used to calculate fb_round_ratio; also used in EDAsnd_1_kill_round — this keeps count of how many rounds there were where the given player got 1 killsnd_2_kill_round — this keeps count of how many rounds there were where the given player got 2 killssnd_3_kill_round — this keeps count of how many rounds there were where the given player got 3 killssnd_4_kill_round — this keeps count of how many rounds there were where the given player got 4 killsfb_avg — this column has the firstblood average for each player; ultimately was only used in the EDAControl:
end_time — helpful for organizing games chronologicallymatch_id — helpful for getting rid of missing datawin — ‘1’ for a win and ‘0’ for a lossteam — player’s corresponding teamplayer — what player does the data correspond tomode — game modek_d — kill/death ratio; used to show overall impact on the mapassists — in addition to k/d, assists show overall support on the maprole — a role is determined for each player depending on their most common gun throughout the yeardamage_dealt — total damage done in the mapplayer_spm — score per minutex2_piece — number of two-pieces (two kills in quick succession)x3_piece — number of three-pieces (three kills in quick succession)x4_piece — number of four-pieces (four kills in quick succession)ctrl_firstbloods — first kill in a round of controlctrl_firstdeaths — first death in a round of controlctrl_captures — number of captures in a control gameI will be splitting my data just before I work on my models. I found that this made the most sense, as I will have to split multiple times for multiple different game modes. This way, it will be easier to organize.
The data below is for all of the majors throughout the season, except for COD Champs. I will use the Champs data later as a way to further test my model. The raw data from each major is merged into one major dataset, further broken up into Hardpoint, SND, and Control datasets.
Below is the code for the merging of the data from all Majors during the season. In order to make sure that the data was ready for use, I went through a few cleaning processes:
# CLEANING
majors2019 <- majors2019 %>% clean_names(.)
# new dataset that contains all of the missing data, just in case
majors2019_missing <- sqldf('SELECT * FROM majors2019 WHERE match_id LIKE "missing%"')
# whole event data, all players and all maps, where player names are organized alphabetically
majors2019 <- majors2019[order(majors2019$player),]
# removes missing values
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE match_id NOT LIKE "missing%"')
# calculates all the players that have played more than 50 games
player_numgames <- count(majors2019, player) %>% subset(., n > 50) %>% remove_cols(n)
# includes all existing data for all players that have played more than 50 games (arbitrary number)
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE player IN player_numgames')
# removes all matches where damage = 0; almost always occurs as a result of data loss
majors2019 <- sqldf('SELECT * FROM majors2019 WHERE damage_dealt != "0"')
# changes W to 1, L to 0
majors2019$win <- ifelse(majors2019$win == "W", 1, 0) %>%
as.factor()
# assigning a role to each player to allow for more precise comparisons
playerRoles <- majors2019 %>%
group_by(player) %>%
count(player, fave_weapon) %>%
top_n(1, n) %>%
mutate(role = fave_weapon) %>%
subset(select = -c(fave_weapon, n))
# A player's role is defined as a sub (1), flex (2), or an ar (3).
# replace fav gun with corresponding role
playerRoles$role <- str_replace(playerRoles$role, "Saug 9mm", "1")
playerRoles$role <- str_replace(playerRoles$role, "Maddox RFB", "2")
playerRoles$role <- str_replace(playerRoles$role, "ICR-7", "3")
# making factors
playerRoles$role <- factor(playerRoles$role)
# manually adjustment for player TJHaly
playerRoles <- playerRoles[-c(83), ]
# joining two sets together
majors2019 <- dplyr::inner_join(playerRoles, majors2019, by = "player")
Now that this was Majors dataset was created, I was able to easily subset from it for each different game mode.
Below is a subset of all the Hardpoint data for the season. I used the “sqldf” package as a way to refresh on my SQL skills, as well as remove any data that I deemed unnecessary. This will make it easier to manage and organize in the future.
# all 2019 hardpoint data
hp2019 <- sqldf('SELECT player, k_d, role, win, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Hardpoint"')
hp2019 <- hp2019[order(hp2019$player),]
Below is a subset of all the Search and Destroy data for the season. This will make it easier to manage and organize in the future. I also add a few different columns: fb_round_ratio and fb_avg.
Note: Ultimately, I ended up not using fb_avg in my models, but I left it as an interesting player-driven statistic.
# all 2019 SND data
snd2019 <- sqldf('SELECT match_id, team, player, role, win, kills, deaths, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods, snd_1_kill_round, snd_2_kill_round, snd_3_kill_round, snd_4_kill_round, x2_piece, x3_piece, x4_piece FROM majors2019 WHERE mode == "Search & Destroy"')
# adds new column with fb/round ratio
snd2019 <- add_column(snd2019, fb_round_ratio = snd2019$snd_firstbloods/snd2019$snd_rounds)
# adding a new column with average first bloods for the season
snd2019 <- snd2019 %>%
group_by(player) %>%
mutate(fb_avg = mean(snd_firstbloods))
# puts data in alphabetical order
snd2019 <- snd2019[order(snd2019$player),]
Below is a subset of all the Control data for the season. This will make it easier to manage and organize in the future.
# all 2019 CONTROL data
control2019 <- sqldf('SELECT player, role, win, k_d, assists, damage_dealt, player_spm, x2_piece, x3_piece, x4_piece, ctrl_firstbloods, ctrl_firstdeaths, ctrl_captures FROM majors2019 WHERE mode == "Control"')
control2019 <- control2019[order(control2019$player),]
Similar to the Majors dataset, I also needed to tidy up the Champs data. I followed these steps:
champs2019 <- champs2019 %>% clean_names(.)
champs2019 <- champs2019[order(champs2019$player),]
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE match_id NOT LIKE "missing%"')
champs2019 <- sqldf('SELECT * FROM champs2019 WHERE damage_dealt != "0"')
# changes W to 1, L to 0
champs2019$win <- ifelse(champs2019$win == "W", 1, 0) %>%
as.factor()
champs2019 <- dplyr::inner_join(playerRoles, champs2019, by = "player")
Below is a subset of the Hardpoint data used for my Champs analysis. I selected directly for the grand finals match, as that would be the match that I would analyze.
# CHAMPS 2019 hardpoint data
hpChamps <- sqldf('SELECT team, end_time, match_id, player, k_d, role, win, kills, deaths, x, assists, damage_dealt, player_spm, hill_time_s, hill_captures, hill_defends, x2_piece, x3_piece, x4_piece FROM champs2019 WHERE series_id = "champs-bracket-grand-finals-0" AND mode = "Hardpoint"')
hpChamps <- hpChamps[order(hpChamps$team),]
Below is a subset of the Search and Destroy data used for my Champs analysis. I selected directly for the grand finals match, as that would be the match that I would analyze.
# CHAMPS 2019 SND data
sndChamps <- sqldf('SELECT team, end_time, match_id, player, win, role, k_d, assists, damage_dealt, player_spm, bomb_sneak_defuses, bomb_plants, bomb_defuses, snd_rounds, snd_firstbloods, x2_piece, x3_piece, x4_piece, snd_1_kill_round, snd_2_kill_round, snd_3_kill_round, snd_4_kill_round FROM champs2019 WHERE mode == "Search & Destroy"')
# adds new column with fb/round ratio
sndChamps <- add_column(sndChamps, fb_round_ratio = sndChamps$snd_firstbloods/sndChamps$snd_rounds)
# adding a new column with average first bloods for the season
sndChamps <- sndChamps %>%
group_by(player) %>%
mutate(fb_avg = mean(snd_firstbloods))
# puts data in alphabetical order
sndChamps <- sndChamps[order(sndChamps$team),]
Below is a subset of the Control data used for my Champs analysis. I selected directly for the grand finals match, as that would be the match that I would analyze.
# CHAMPS 2019 CONTROL data
controlChamps <- sqldf('SELECT team, end_time, ctrl_firstbloods, ctrl_firstdeaths, ctrl_captures, x2_piece, x3_piece, x4_piece, player, role, win, k_d, assists, damage_dealt, player_spm FROM champs2019 WHERE mode == "Control"')
controlChamps <- controlChamps[order(controlChamps$team),]
For my exploratory data analysis, I will be using just the season data. It will not include the Champs data. It is done on non-split data, since I was personally interested in seeing the top performers for the WHOLE season. Splitting the data, while it would help remove any type of bias on my end, would skew the EDA and would ultimately confuse me more in the future.
From what I know about Call of Duty, K/D is always the most important, or at least the most talked about, statistic for a player. It is using an indicator that a player is doing really well for his team. However, it also must be taken with a grain of salt, as some players are known to pad their stats by “baiting” their teammate. The following graphs will look directly on the varying K/D of players over the whole season.
ggplot(majors2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", title = "OVERALL Player K/D's, 2019 Season (BO4), Descending")
As we can see from the above graph, the players with the top overall K/D’s for the season were:
This is interesting since this season was Simp’s breakout year. Since then, Simp has grown to be regarded as one of the best Call of Duty player’s of all time, having won 3 rings in 4 years; something that even the greats haven’t done.
We can also see that the players with the lowest overall K/D’s for the season were:
ggplot(hp2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", title = "Player K/D's for HARDPOINT, 2019 Season (BO4), Descending")
From this plot, we can see that the top K/D’s in Hardpoint were:
The lowest K/D’s in Hardpoint were:
Four of these bottom performers were also apart of the lowest overall K/D’s. Hardpoint may be the reason that they all struggled.
ggplot(snd2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 5)) + labs(y = "Kill/death ratio", x = "Player", title = "Player K/D's for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
From this plot, we can see that the top K/D’s for Search and Destroy were:
The lowest K/D’s in Search and Destroy were:
Once again, we can see that Knight, Goonjar, and Diabolic have not been performing well.
ggplot(control2019, aes(x = reorder(player, k_d), y = k_d)) + geom_boxplot() + coord_flip(ylim = c(0, 3.5)) + labs(y = "Kill/death ratio", x = "Player", title = "Player K/D's for CONTROL, 2019 Season (BO4), Descending")
From this plot, we can see that the top K/D’s for Control were:
Throughout all of these top K/D’s, we have repeatedly seen Simp and Dashy at the top, which reflects in their overall K/D’s.
The lowest K/D’s in Control were:
I only recently started watching Call of Duty again in the past few years, which means I didn’t watch this season that I am studying. I find it interesting that I have never heard of any of these bottom players except Felony. They couldn’t put up good numbers, and thus they haven’t made it back into the league.
Search and Destroy is a game mode that has multiple rounds, where in each round, every player only has one life. A “first blood” is the first kill of the round and is usually highly influential. This a common stat that commentators and the community look at.
The firstblood average metric is simply just the average amount of firstbloods that a player has in a game.
# player firstblood average for SND 2019
ggplot(snd2019, aes(x = reorder(player, fb_avg), y = fb_avg)) + geom_point() + coord_flip(ylim = c(0, 3)) + labs(y = "Firstblood Average", x = "Player", title = "Player Firstblood Average for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
As we can see from the plot, Huke lead the pack with about 1.7 firstbloods per game. He was followed closely by Envoy, Dylan, Lucky, and Attach with about 1.4 firstbloods per game each. This is interesting since most of these players are SMG (submachine gun) players. Their role is to find routes and open up the round by (hopefully) getting a first blood. The community would likely look at each of these players with high praise; considering that I have heard about everyone except Lucky, these players are great performers even still. Next, I will look at total firstbloods per player.
# player firstbloods for SND 2019
ggplot(snd2019, aes(x = reorder(player, snd_firstbloods), y = snd_firstbloods)) + geom_boxplot() + coord_flip(ylim = c(0, 6)) + labs(y = "Firstbloods", x = "Player", title = "Player Firstbloods for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
The reason I plotted this was mainly to look at the outliers in the dataset. I was interested in seeing what players had firstbloods in a single game. By looking at this plot, we can see that Dylan, Lucky, and Slasher all had 6 firstbloods in a single game. I would predict that they all won their respective games. Next, I will look at each players’ firstblood/round ratio.
I created this metric as a more realistic way of comparing players’ firstbloods. Obviously, if a game goes on for 11 rounds, a player is way more likely to have more firstbloods than if a game only went for 6 rounds. Because of this, I wanted to see the ratio for the number of firstbloods per round.
# player firstblood/round for SND 2019
ggplot(snd2019, aes(x = reorder(player, fb_round_ratio), y = fb_round_ratio)) + geom_boxplot() + coord_flip(ylim = c(0, 0.6)) + labs(y = "Firstblood/round ratio", x = "Player", title = "Player Firstblood/Round for SEARCH AND DESTROY, 2019 Season (BO4), Descending")
I didn’t know what qualified as a “good” number for this metric, so it was important that I put this in descending order.
The players with the top Firstblood/round ratio were:
Huke, Envoy, Dylan, and Lucky seem to be common factors when it comes to getting firstbloods in Search and Destroy. Next, I want to look at how each player compares with their overall damage dealt.
# player damage dealt OVERALL 2019
# created a new dataset for plotting damage_dealt
playerDamage <- sqldf('SELECT player, damage_dealt FROM majors2019 WHERE damage_dealt != "0"')
ggplot(playerDamage, aes(x = reorder(player, damage_dealt), y = damage_dealt)) + geom_boxplot() + coord_flip(ylim = c(0, 10000)) + labs(y = "Damage Dealt", x = "Player", title = "OVERALL Player Damage Dealt, 2019 Season (BO4), Descending")
From the above plot, we can see that the top performers damage wise were:
The lowest damage was put out by:
Once again, there are common factors between the highest performers and the lowest performers. It is also worth noting that every player in the top 5 are still competing to this day, whereas I have never heard of any of the bottom 5. Next, I will look at player’s overall score per minute.
This is a metric that has always been a part of Call of Duty, but as an average player, I never really look at it. I want to see how this statistic applies to professionals.
# Overall score per minute for 2019 season
ggplot(majors2019, aes(x = reorder(player, player_spm), y = player_spm)) + geom_boxplot() + coord_flip(ylim = c(0, 675)) + labs(y = "Score per minute", x = "Player", title = "OVERALL Player Score per minute, 2019 Season (BO4), Descending")
The players with the highest SPM were:
The players with the lowest SPM were:
What surprised me about this plot was to see Bance and JKap in the bottom 5. Bance is still considered a top player to this day and JKap has been a Call of Duty figurehead for years. On the other hand, the common names at the top were, once again, at the top. Next, I will look at each players’ number of wins throughout the year.
# Overall number of wins for 2019 season
playerwins <- sqldf('SELECT player, win FROM majors2019 WHERE win == "1"') # selects all the wins for each player
playerwins <- playerwins %>% count(player) # counts the number of wins per player
ggplot(playerwins, aes(x = reorder(player, n), y = n)) + geom_bar(stat = 'identity', fill = "lightblue", color = "white") + coord_flip() + labs(y = "Number of Wins", x = "Player", title = "OVERALL Number of Wins per Player, 2019 Season (BO4), Descending")
The top 4 players with the most amount of wins in the season are Slasher, Octane, Kenny, and Enable. The interesting part about this is that all of these players were on the same team, 100 Thieves. They all tied with 116 wins during the season. They are followed closely by Priestahh, who was their fifth teammate. He likely had to sit out of a few matches, or he got picked up shortly after the start of the season. Next, I will look at the distribution of wins.
playerwins %>%
ggplot(aes(x = n)) + geom_histogram(binwidth = 15, color = "black", fill = "lightblue")
The number of wins appears to follow a normal distribution, with the median being about 60 wins. The top players had around 100-116 wins.
In order to plot the number of wins versus each stat, I will be making individual datasets as I go along. I feel that this section will be useful in determining possible strong indicators for a win or a loss. First, I will begin by plotting the number of wins vs. overall K/D.
playerkds <- sqldf('SELECT player, k_d FROM majors2019')
playerkds <- aggregate(playerkds$k_d, list(playerkds$player), FUN=mean)
kd_wins <- cbind(playerkds, playerwins)
ggplot(kd_wins, aes(x = x, y = n)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = "Wins vs. Overall K/D", x = "Overall K/D", y = "Number of Wins")
The above plot shows that there is a positive correlation between a player’s overall K/D and the number of wins that they have. This might indicate that K/D is a good predictor for a win.
playerfb <- sqldf('SELECT player, snd_firstbloods FROM majors2019')
playerfb <- aggregate(playerfb$snd_firstbloods, list(playerfb$player), FUN=mean)
fb_wins <- cbind(playerfb, playerwins)
ggplot(fb_wins, aes(x = x, y = n)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = "Wins vs. Firstblood Average", x = "Average Firstbloods per Game", y = "Number of Wins")
Again, there appears to be a positive correlation between the two variables, but it definitely weaker than with K/D. Nevertheless, firstbloods should make for a decent indicator of a win or loss.
playerdamage <- sqldf('SELECT player, damage_dealt FROM majors2019')
playerdamage <- aggregate(playerdamage$damage_dealt, list(playerdamage$player), FUN=mean)
damage_wins <- cbind(playerdamage, playerwins)
ggplot(damage_wins, aes(x = x, y = n)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = "Wins vs. Damage Dealt", x = "Average Damage per Game", y = "Number of Wins")
There also appears to be a positive correlation with the number of damage per game and the amount of wins a player has. Damage will likely make for a good predictor of a win or a loss.
playerspm <- sqldf('SELECT player, player_spm FROM majors2019')
playerspm <- aggregate(playerspm$player_spm, list(playerspm$player), FUN=mean)
spm_wins <- cbind(playerspm, playerwins)
ggplot(spm_wins, aes(x = x, y = n)) +
geom_point() +
geom_smooth(method = lm) +
labs(title = "Wins vs. Score per Minute", x = "Average SPM per Game", y = "Number of Wins")
Score per minute also has a positive correlation with the number of wins. Like all the others, this appears to be a decent predictor moving forward.
While there are definitely more metrics to investigate, I believe that my data analysis has provided me with a solid understanding of the dataset and an idea of what to expect in the later parts of my research. Next, I will move on to the exciting part: model building.
Because there are multiple different game modes that I am trying to predict, I must fit models to each one. This is the reason that there are so many models below. They are essentially the same thing, but their recipes are different to accommodate new predictor variables and remove unnecessary ones. I will begin with Hardpoint, then move to Search and Destroy, and then wrap up with Control. To reiterate, my goal is to try and predict whether an individual player will win or lose a game based on their statistics in the given game.
Here, I am splitting the Hardpoint data with 80% training and 20% testing. The data is stratified on the “win” variable.
set.seed(3068)
hp2019_split <- hp2019 %>%
initial_split(prop = 0.8, strata = "win")
hp2019_train <- training(hp2019_split)
hp2019_test <- testing(hp2019_split)
The few following lines of code are just to ensure that my data split correctly and that everything is looking good to proceed.
Head of dataset:
head(hp2019_train)
## player k_d role win assists damage_dealt player_spm hill_time_s
## 4 Abezy 0.66 1 0 6 3891 290.9 48
## 5 Abezy 1.18 1 0 8 4480 393.3 55
## 12 Abezy 0.88 1 0 14 4515 322.3 80
## 15 Abezy 0.76 1 0 8 4868 295.7 77
## 16 Abezy 0.83 1 0 4 3954 269.5 35
## 20 Abezy 1.33 1 0 7 4733 400.3 71
## hill_captures hill_defends x2_piece x3_piece x4_piece
## 4 4 5 2 0 0
## 5 5 14 2 1 0
## 12 9 6 4 0 0
## 15 7 11 5 0 0
## 16 3 6 2 2 0
## 20 6 10 3 0 0
Training dataset dimensions:
dim(hp2019_train)
## [1] 3551 13
Testing dataset dimensions:
dim(hp2019_test)
## [1] 889 13
Distribution for the number of wins in the training set:
prop.table(table(hp2019_train$win))
##
## 0 1
## 0.4987328 0.5012672
To begin with my Hardpoint models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.
hp_recipe <- recipe(win ~ k_d + assists + damage_dealt +
player_spm + hill_time_s + hill_captures +
hill_defends + x2_piece + x3_piece + x4_piece,
data = hp2019_train) %>%
step_normalize(all_predictors())
After making my recipe, I decided to fold my data with 10 folds and 5 repeats.
hp_train_folds <- vfold_cv(hp2019_train, v = 10, repeats = 5)
In order to create the decision tree, I used the following steps:
Creating a general decision tree specification using rpart:
hp_tree_spec <- decision_tree() %>%
set_engine("rpart")
Setting a classification decision tree engine:
hp_class_tree_spec <- hp_tree_spec %>%
set_mode("classification")
Fitting the model:
hp_class_tree_fit <- hp_class_tree_spec %>%
fit(win ~ k_d + assists + damage_dealt +
player_spm + hill_time_s + hill_captures +
hill_defends + x2_piece + x3_piece + x4_piece,
data = hp2019_train)
Visualizing the decision tree:
hp_class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint=FALSE)
Checking confusion matrix and accuracy of the train data:
augment(hp_class_tree_fit, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1379 601
## 1 392 1179
hp_dt_accuracy <- augment(hp_class_tree_fit, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
hp_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.720
Creating a workflow that is ready to tune cost complexity:
hp_class_tree_wf <- workflow() %>%
add_model(hp_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(hp_recipe)
Setting up a regular grid:
param_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
Fitting and tuning our model:
hp_dt_tune <- hp_class_tree_wf %>%
tune_grid(
hp_class_tree_wf,
resamples = hp_train_folds,
grid = param_grid)
Plotting our model, which shows what cost-complexity produces the highest accuracy:
autoplot(hp_dt_tune)
Selecting the best performing value and finalizing the workflow:
hp_best_complexity <- select_best(hp_dt_tune)
hp_class_tree_final <- finalize_workflow(hp_class_tree_wf, hp_best_complexity)
hp_class_tree_final_fit <- fit(hp_class_tree_final, data = hp2019_train)
Visualizing the final model:
hp_class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint = FALSE)
Checking the accuracy of the final model:
hp_tuned_dt_accuracy <- augment(hp_class_tree_final_fit, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
hp_tuned_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.756
As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.756, compared to that of the untuned model with an estimate of 0.720.
Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.
hp_rf_model <- rand_forest(min_n = tune(),
mtry = tune(),
mode = "classification") %>%
set_engine("ranger")
hp_rf_workflow <- workflow() %>%
add_model(hp_rf_model) %>%
add_recipe(hp_recipe)
Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 2 to 10. This mtry limit was set slightly below the maximum number of predictors.
hp_rf_parameters <- hardhat::extract_parameter_set_dials(hp_rf_model) %>%
update(mtry = mtry(range = c(2, 10)))
hp_rf_grid <- grid_regular(hp_rf_parameters, levels = 2)
Then, I ran my model by tuning and fitting, using my folded data and my grid.
hp_rf_tune <- hp_rf_workflow %>%
tune_grid(resamples = hp_train_folds,
grid = hp_rf_grid)
The last thing to do was to plot my tuned model.
autoplot(hp_rf_tune)
As we can see from the above plot, it appears that as we add more predictor variables, the accuracy tends to decrease. I hypothesize that this is because k_d is the most significant predictor by far. All of the other predictors are much less significant, and actually worsen the model by overfitting. However, the decrease in accuracy is very small in general.
Checking the accuracy of the final model:
hp_rf_tuned_accuracy <- show_best(hp_rf_tune, metric = "accuracy")
hp_rf_tuned_accuracy
## # A tibble: 4 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 2 accuracy binary 0.744 50 0.00304 Preprocessor1_Model1
## 2 2 40 accuracy binary 0.744 50 0.00280 Preprocessor1_Model3
## 3 10 40 accuracy binary 0.739 50 0.00300 Preprocessor1_Model4
## 4 10 2 accuracy binary 0.734 50 0.00298 Preprocessor1_Model2
hp_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.744
We had the highest accuracy of 0.744 with a minimum node size of 2 and an mtry of 2.
First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”
hp_log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Set up workflow with the model I created last step, as well as the recipe that I created earlier.
hp_log_wkflow <- workflow() %>%
add_model(hp_log_reg) %>%
add_recipe(hp_recipe)
Fit the model to the folded data:
hp_log_fit <- fit_resamples(hp_log_wkflow, hp_train_folds)
Collecting metrics based on the folded data:
collect_metrics(hp_log_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.748 50 0.00339 Preprocessor1_Model1
## 2 roc_auc binary 0.834 50 0.00296 Preprocessor1_Model1
Fitting the model to the whole dataset, not just the folds:
hp_log_fit_train <- fit(hp_log_wkflow, hp2019_train)
Assessing model performance with the training data:
predict(hp_log_fit_train, new_data = hp2019_train, type = "prob")
## # A tibble: 3,551 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.845 0.155
## 2 0.745 0.255
## 3 0.242 0.758
## 4 0.860 0.140
## 5 0.914 0.0865
## 6 0.342 0.658
## 7 0.816 0.184
## 8 0.668 0.332
## 9 0.707 0.293
## 10 0.638 0.362
## # … with 3,541 more rows
Confusion matrix:
augment(hp_log_fit_train, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1380 499
## 1 391 1281
Another visualization of the confusion matrix:
augment(hp_log_fit_train, new_data = hp2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy with the training data:
hp_log_reg_accuracy <- augment(hp_log_fit_train, new_data = hp2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
hp_log_reg_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.749
The accuracy with the training data was about 0.749, which is roughly the same as the accuracy with the folded data.
Just like all of the previous models, I will follow a similar pathway. The steps are as follows:
Setting up the model. I will be tuning “neighbors.”
hp_knn_model <-
nearest_neighbor(
neighbors = tune(),
mode = "classification") %>%
set_engine("kknn")
Next, I set up the workflow.
hp_knn_workflow <- workflow() %>%
add_model(hp_knn_model) %>%
add_recipe(hp_recipe)
I then set up the tuning grid.
hp_knn_parameters <- hardhat::extract_parameter_set_dials(hp_knn_model)
hp_knn_grid <- grid_regular(hp_knn_parameters, levels = 2)
Fitting and tuning my model:
hp_knn_tune <- hp_knn_workflow %>%
tune_grid(resamples = hp_train_folds,
grid = hp_knn_grid)
Plotting the model:
autoplot(hp_knn_tune, metric = "accuracy")
This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.
Testing the accuracy of the model:
hp_knn_tuned_accuracy <- show_best(hp_knn_tune, metric = "accuracy")
hp_knn_tuned_accuracy
## # A tibble: 2 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 accuracy binary 0.709 50 0.00334 Preprocessor1_Model2
## 2 1 accuracy binary 0.649 50 0.00329 Preprocessor1_Model1
As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.709.
hp_accuracies <- c(hp_tuned_dt_accuracy$.estimate,
hp_rf_tuned_accuracy[1,5],
hp_log_reg_accuracy$.estimate,
hp_knn_tuned_accuracy[1, 4])
hp_accuracies
## [[1]]
## [1] 0.7555618
##
## $mean
## [1] 0.7438462
##
## [[3]]
## [1] 0.7493664
##
## $mean
## [1] 0.7091532
The first number is the decision tree, followed by the random forest, logistic regression, and k-th nearest neighbor model.
Our most accurate model was the decision tree with an accuracy of 0.755, followed by logistic regression’s 0.750, random forest’s 0.744, and k-th nearest neighbor’s 0.710.
Here, I am splitting the Search and Destroy data with 80% training and 20% testing. The data is stratified on the “win” variable.
set.seed(1)
snd2019_split <- snd2019 %>%
initial_split(prop = 0.8, strata = "win")
snd2019_train <- training(snd2019_split)
snd2019_test <- testing(snd2019_split)
The few following lines of code are just to ensure that my data split correctly and that everything is looking good to proceed.
Head of the training set:
head(snd2019_train)
## # A tibble: 6 × 25
## # Groups: player [1]
## match_id team player role win kills deaths k_d assists damage_dealt
## <chr> <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 737440468739… eUni… Abezy 1 0 11 7 1.57 0 1355
## 2 144010197940… eUni… Abezy 1 0 7 7 1 1 1938
## 3 259560665349… eUni… Abezy 1 0 2 7 0.29 4 796
## 4 146217688303… eUni… Abezy 1 0 7 7 1 0 1275
## 5 177954718572… eUni… Abezy 1 0 9 8 1.12 1 1561
## 6 167778343948… eUni… Abezy 1 0 10 9 1.11 3 1434
## # … with 15 more variables: player_spm <dbl>, bomb_sneak_defuses <dbl>,
## # bomb_plants <dbl>, bomb_defuses <dbl>, snd_rounds <dbl>,
## # snd_firstbloods <dbl>, snd_1_kill_round <dbl>, snd_2_kill_round <dbl>,
## # snd_3_kill_round <dbl>, snd_4_kill_round <dbl>, x2_piece <dbl>,
## # x3_piece <dbl>, x4_piece <dbl>, fb_round_ratio <dbl>, fb_avg <dbl>
Dimensions of the training set:
dim(snd2019_train)
## [1] 2791 25
Dimensions of the testing set:
dim(snd2019_test)
## [1] 699 25
Distribution of wins for the training set:
prop.table(table(snd2019_train$win))
##
## 0 1
## 0.498746 0.501254
To begin with my Search and Destroy models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.
snd_recipe <- recipe(win ~ k_d + assists + damage_dealt +
player_spm + bomb_sneak_defuses +
bomb_plants + bomb_defuses + snd_firstbloods +
fb_round_ratio + snd_1_kill_round +
snd_2_kill_round + snd_3_kill_round +
snd_4_kill_round + x2_piece + x3_piece + x4_piece,
data = snd2019_train) %>%
step_normalize(all_predictors())
After making my recipe, I decided to fold my data with 10 folds and 5 repeats.
snd_train_folds <- vfold_cv(snd2019_train, v = 10, repeats = 5)
Creating a general decision tree specification using rpart:
snd_tree_spec <- decision_tree() %>%
set_engine("rpart")
Setting a classification decision tree engine:
snd_class_tree_spec <- snd_tree_spec %>%
set_mode("classification")
Fitting the model:
snd_class_tree_fit <- snd_class_tree_spec %>%
fit(win ~ k_d + assists + damage_dealt +
player_spm + bomb_sneak_defuses +
bomb_plants + bomb_defuses + snd_firstbloods +
fb_round_ratio + snd_1_kill_round +
snd_2_kill_round + snd_3_kill_round +
snd_4_kill_round + x2_piece + x3_piece + x4_piece,
data = snd2019_train)
Visualizing the decision tree:
snd_class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint=FALSE)
Checking confusion matrix and accuracy of the train data:
augment(snd_class_tree_fit, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1099 544
## 1 293 855
snd_dt_accuracy <- augment(snd_class_tree_fit, new_data = snd2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
snd_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.700
Creating a workflow that is ready to tune cost complexity:
snd_class_tree_wf <- workflow() %>%
add_model(snd_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(snd_recipe)
Setting up a regular grid:
parameter_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
Fitting and tuning our model:
snd_rf_tune <- snd_class_tree_wf %>%
tune_grid(resamples = snd_train_folds, grid = parameter_grid)
Plotting our model, which shows what cost-complexity produces the highest accuracy:
autoplot(snd_rf_tune)
Selecting the best performing value and finalizing the workflow:
snd_best_complexity <- select_best(snd_rf_tune, metric = "accuracy")
snd_class_tree_final <- finalize_workflow(snd_class_tree_wf, snd_best_complexity)
snd_class_tree_final_fit <- fit(snd_class_tree_final, data = snd2019_train)
Visualizing the final model:
snd_class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint = FALSE)
Checking the accuracy of the final model:
snd_tuned_dt_accuracy <- augment(snd_class_tree_final_fit, new_data = snd2019_train) %>% accuracy(truth = win, estimate = .pred_class)
snd_tuned_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.742
As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.742, compared to that of the untuned model with an estimate of 0.700.
Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my SND recipe and my SND random forest model.
snd_rf_model <- rand_forest(min_n = tune(),
mtry = tune(),
mode = "classification") %>%
set_engine("ranger")
snd_rf_workflow <- workflow() %>%
add_model(snd_rf_model) %>%
add_recipe(snd_recipe)
Next, I set up parameters for the grid that I was going to make. The parameters were set with the SND random forest model and the mtry range was set from 1 to 10. This mtry limit was set slightly below the maximum number of predictors.
snd_rf_parameters <- hardhat::extract_parameter_set_dials(snd_rf_model) %>%
update(mtry = mtry(range = c(1, 10)))
snd_rf_grid <- grid_regular(snd_rf_parameters, levels = 2)
Then, I ran my model by tuning and fitting, using my folded data and my grid.
snd_rf_tune <- snd_rf_workflow %>%
tune_grid(resamples = snd_train_folds,
grid = snd_rf_grid)
The last thing to do was to plot my tuned model.
autoplot(snd_rf_tune)
As we can see from the above plot, when we increase the number of predictors, our accuracy and ROC AUC slightly increases. I hypothesize that this is different from the Hardpoint plot because there are more predictors that are significant in Search and Destroy.
Checking the accuracy of the final model:
snd_rf_tuned_accuracy <- show_best(snd_rf_tune, metric = "accuracy")
snd_rf_tuned_accuracy
## # A tibble: 4 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 40 accuracy binary 0.709 50 0.00373 Preprocessor1_Model4
## 2 10 2 accuracy binary 0.703 50 0.00376 Preprocessor1_Model2
## 3 1 2 accuracy binary 0.668 50 0.00463 Preprocessor1_Model1
## 4 1 40 accuracy binary 0.666 50 0.00500 Preprocessor1_Model3
snd_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.709
We had the highest accuracy of 0.709 with a minimum node size of 40 and an mtry of 10.
First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”
snd_log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Setting up workflow with the model I created last step, as well as the recipe that I created earlier.
snd_log_wkflow <- workflow() %>%
add_model(snd_log_reg) %>%
add_recipe(snd_recipe)
Fit the model to the folded data:
snd_log_fit <- fit_resamples(snd_log_wkflow, snd_train_folds)
Collecting metrics based on the folded data:
collect_metrics(snd_log_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.732 50 0.00328 Preprocessor1_Model1
## 2 roc_auc binary 0.807 50 0.00379 Preprocessor1_Model1
Fitting the model to the whole dataset, not just the folds:
snd_log_fit_train <- fit(snd_log_wkflow, snd2019_train)
Assessing model performance with the training data:
predict(snd_log_fit_train, new_data = snd2019_train, type = "prob")
## # A tibble: 2,791 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.571 0.429
## 2 0.739 0.261
## 3 0.712 0.288
## 4 0.654 0.346
## 5 0.771 0.229
## 6 0.825 0.175
## 7 0.486 0.514
## 8 0.748 0.252
## 9 0.433 0.567
## 10 0.458 0.542
## # … with 2,781 more rows
Confusion matrix:
augment(snd_log_fit_train, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 1134 474
## 1 258 925
Another visualization of confusion matrix:
augment(snd_log_fit_train, new_data = snd2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy with the training data:
snd_log_reg_accuracy <- augment(snd_log_fit_train, new_data = snd2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
snd_log_reg_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.738
The final accuracy that we got was 0.738, which was slightly higher than the accuracy of 0.732 on the folded data.
Just like all of the previous models, I will follow a similar pathway. The steps are as follows:
Setting up the model. I will be tuning “neighbors.”
snd_knn_model <-
nearest_neighbor(
neighbors = tune(),
mode = "classification") %>%
set_engine("kknn")
Next, I set up the workflow.
snd_knn_workflow <- workflow() %>%
add_model(snd_knn_model) %>%
add_recipe(snd_recipe)
I then set up the tuning grid.
snd_knn_parameters <- hardhat::extract_parameter_set_dials(snd_knn_model)
snd_knn_grid <- grid_regular(snd_knn_parameters, levels = 2)
Fitting and tuning my model:
snd_knn_tune <- snd_knn_workflow %>%
tune_grid(resamples = snd_train_folds,
grid = snd_knn_grid)
Plotting the model:
autoplot(snd_knn_tune, metric = "accuracy")
This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.
Testing the accuracy of the model:
snd_knn_tuned_accuracy <- show_best(snd_knn_tune, metric = "accuracy")
snd_knn_tuned_accuracy
## # A tibble: 2 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 accuracy binary 0.619 50 0.00403 Preprocessor1_Model2
## 2 1 accuracy binary 0.595 50 0.00385 Preprocessor1_Model1
As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.6188415.
snd_accuracies <- c(snd_tuned_dt_accuracy$.estimate,
snd_rf_tuned_accuracy[1,5],
snd_log_reg_accuracy$.estimate,
snd_knn_tuned_accuracy[1, 4])
snd_accuracies
## [[1]]
## [1] 0.7416697
##
## $mean
## [1] 0.7091295
##
## [[3]]
## [1] 0.7377284
##
## $mean
## [1] 0.6188415
The first number is the decision tree, followed by the random forest, logistic regression, and k-th nearest neighbor model.
As we can see from this, the decision tree appears to have the highest accuracy with 0.742. This is followed by logistic regression with 0.738, random forest with 0.709, and k-th nearest neighbors with 0.619.
Here, I am splitting the Search and Destroy data with 80% training and 20% testing. The data is stratified on the “win” variable.
set.seed(1)
control2019_split <- control2019 %>%
initial_split(prop = 0.8, strata = "win")
control2019_train <- training(control2019_split)
control2019_test <- testing(control2019_split)
The few following lines of code are just to ensure that my data split correctly and that everything is looking good to proceed.
Head of train set:
head(control2019_train)
## player role win k_d assists damage_dealt player_spm x2_piece x3_piece
## 4 Abezy 1 0 1.30 17 6459 386.2 5 1
## 16 Abezy 1 0 0.88 4 5353 215.9 2 0
## 18 Abezy 1 0 1.00 3 2657 285.7 3 0
## 21 Abezy 1 0 0.70 4 3077 225.0 1 0
## 22 Abezy 1 0 0.75 5 3357 217.5 2 1
## 23 Abezy 1 0 0.78 5 4315 215.3 1 1
## x4_piece ctrl_firstbloods ctrl_firstdeaths ctrl_captures
## 4 0 1 0 4
## 16 0 1 0 2
## 18 0 0 0 1
## 21 0 0 0 1
## 22 0 1 1 1
## 23 0 1 0 2
Dimensions of train set:
dim(control2019_train)
## [1] 2122 13
Dimensions of test set:
dim(control2019_test)
## [1] 532 13
Win distrbition of train data:
prop.table(table(control2019_train$win))
##
## 0 1
## 0.4995287 0.5004713
The begin with my Control models, I made a recipe that contained all of my predictor variables. I then normalized all of my predictor variables.
control_recipe <- recipe(win ~ k_d + assists + damage_dealt +
player_spm + ctrl_firstbloods +
ctrl_firstdeaths + ctrl_captures +
x2_piece + x3_piece + x4_piece,
data = control2019_train) %>%
step_normalize(all_predictors())
After making my recipe, I decided to fold my data with 10 folds and 5 repeats.
control_train_folds <- vfold_cv(control2019_train, v = 10, repeats = 5)
Creating a general decision tree specification using rpart:
control_tree_spec <- decision_tree() %>%
set_engine("rpart")
Setting a classification decision tree engine:
control_class_tree_spec <- control_tree_spec %>%
set_mode("classification")
Fitting the model:
control_class_tree_fit <- control_class_tree_spec %>%
fit(win ~ k_d + assists + damage_dealt +
player_spm + ctrl_firstbloods +
ctrl_firstdeaths + ctrl_captures +
x2_piece + x3_piece + x4_piece,
data = control2019_train)
Visualizing the decision tree:
control_class_tree_fit %>%
extract_fit_engine() %>%
rpart.plot()
Checking confusion matrix and accuracy of the train data:
augment(control_class_tree_fit, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 742 250
## 1 318 812
control_dt_accuracy <- augment(control_class_tree_fit, new_data = control2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
control_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.732
Creating a workflow that is ready to tune cost complexity:
control_class_tree_wf <- workflow() %>%
add_model(control_class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(control_recipe)
Setting up a regular grid:
parameter_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
Fitting and tuning our model:
control_rf_tune <- control_class_tree_wf %>%
tune_grid(resamples = control_train_folds, grid = parameter_grid)
Plotting our model, which shows what cost-complexity produces the highest accuracy:
autoplot(control_rf_tune)
As we can see, the accuracy and ROC AUC tends to decrease as the cost-complexity parameter increases.
Selecting the best performing value and finalizing the workflow:
control_best_complexity <- select_best(control_rf_tune, metric = "accuracy")
control_class_tree_final <- finalize_workflow(control_class_tree_wf, control_best_complexity)
control_class_tree_final_fit <- fit(control_class_tree_final, data = control2019_train)
Visualizing the final model:
control_class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot(roundint = FALSE)
Checking the accuracy of the final model:
control_tuned_dt_accuracy <- augment(control_class_tree_final_fit, new_data = control2019_train) %>% accuracy(truth = win, estimate = .pred_class)
control_tuned_dt_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.827
As we can see, the final accuracy for the tuned model was slightly higher with an estimate of 0.827, compared to that of the untuned model with an estimate of 0.732.
Now it was time to prepare my model. I tuned min_n and mtry, set my mode to “classification”, and set my engine to “ranger.” My workflow was set up to use both my Hardpoint recipe and my Hardpoint random forest model.
control_rf_model <- rand_forest(min_n = tune(),
mtry = tune(),
mode = "classification") %>%
set_engine("ranger")
control_rf_workflow <- workflow() %>%
add_model(control_rf_model) %>%
add_recipe(control_recipe)
Next, I set up parameters for the grid that I was going to make. The parameters were set the Hardpoint random forest model and the mtry range was set from 1 to 9. This mtry limit was set slightly below the maximum number of predictors.
control_rf_parameters <- hardhat::extract_parameter_set_dials(control_rf_model) %>%
update(mtry = mtry(range = c(1, 9)))
control_rf_grid <- grid_regular(control_rf_parameters, levels = 2)
Then, I ran my model by tuning and fitting, using my folded data and my grid.
control_rf_tune <- control_rf_workflow %>%
tune_grid(resamples = control_train_folds,
grid = control_rf_grid)
The last thing to do was to plot my tuned model.
autoplot(control_rf_tune)
As we can see from the above plot, when we increase the number of predictors, our accuracy and ROC and AUC slightly increases. I hypothesize that this is different from the Hardpoint plot because there are more predictors that are significant in Control.
Checking the accuracy of the final model:
control_rf_tuned_accuracy <- show_best(control_rf_tune, metric = "accuracy")
control_rf_tuned_accuracy
## # A tibble: 4 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 9 40 accuracy binary 0.749 50 0.00453 Preprocessor1_Model4
## 2 9 2 accuracy binary 0.742 50 0.00443 Preprocessor1_Model2
## 3 1 2 accuracy binary 0.740 50 0.00420 Preprocessor1_Model1
## 4 1 40 accuracy binary 0.736 50 0.00438 Preprocessor1_Model3
control_rf_tuned_accuracy[1,5]
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 0.749
We had the highest accuracy of 0.749 with a minimum node size of 40 and an mtry of 9.
First, I needed to set up my model. I set my engine to “glm” for logistic regression and set the mode to “classification.”
control_log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
Setting up workflow with the model I created last step, as well as the recipe that I created earlier.
control_log_wkflow <- workflow() %>%
add_model(control_log_reg) %>%
add_recipe(control_recipe)
Fit the model to the folded data:
control_log_fit <- fit_resamples(control_log_wkflow, control_train_folds)
Collecting metrics based on the folded data:
collect_metrics(control_log_fit)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.767 50 0.00504 Preprocessor1_Model1
## 2 roc_auc binary 0.841 50 0.00467 Preprocessor1_Model1
Fitting the model to the whole dataset, not just the folds:
control_log_fit_train <- fit(control_log_wkflow, control2019_train)
Assessing model performance with the training data:
predict(control_log_fit_train, new_data = control2019_train, type = "prob")
## # A tibble: 2,122 × 2
## .pred_0 .pred_1
## <dbl> <dbl>
## 1 0.0608 0.939
## 2 0.855 0.145
## 3 0.717 0.283
## 4 0.875 0.125
## 5 0.830 0.170
## 6 0.810 0.190
## 7 0.620 0.380
## 8 0.511 0.489
## 9 0.534 0.466
## 10 0.339 0.661
## # … with 2,112 more rows
Confusion matrix:
augment(control_log_fit_train, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class)
## Truth
## Prediction 0 1
## 0 849 277
## 1 211 785
Another visualization of the confusion matrix:
augment(control_log_fit_train, new_data = control2019_train) %>%
conf_mat(truth = win, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Checking accuracy with the training data:
control_log_reg_accuracy <- augment(control_log_fit_train, new_data = control2019_train) %>%
accuracy(truth = win, estimate = .pred_class)
control_log_reg_accuracy
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.770
The final model had an accuracy of the 0.770, which was slightly higher than the folded data accuracy of 0.767.
Just like all of the previous models, I will follow a similar pathway. The steps are as follows:
Setting up the model. I will be tuning “neighbors.”
control_knn_model <-
nearest_neighbor(
neighbors = tune(),
mode = "classification") %>%
set_engine("kknn")
Next, I set up the workflow.
control_knn_workflow <- workflow() %>%
add_model(control_knn_model) %>%
add_recipe(control_recipe)
I then set up the tuning grid.
control_knn_parameters <- hardhat::extract_parameter_set_dials(control_knn_model)
control_knn_grid <- grid_regular(control_knn_parameters, levels = 2)
Fitting and tuning my model:
control_knn_tune <- control_knn_workflow %>%
tune_grid(resamples = control_train_folds,
grid = control_knn_grid)
Plotting the model:
autoplot(control_knn_tune, metric = "accuracy")
This plot shows us that as the number of neighbors increases, there is also an increase in accuracy.
Testing the accuracy of the model:
control_knn_tuned_accuracy <- show_best(control_knn_tune, metric = "accuracy")
control_knn_tuned_accuracy
## # A tibble: 2 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 accuracy binary 0.735 50 0.00426 Preprocessor1_Model2
## 2 1 accuracy binary 0.679 50 0.00472 Preprocessor1_Model1
As we can see, the model performs at its best with 15 neighbors, resulting in an accuracy of 0.735.
control_accuracies <- c(control_tuned_dt_accuracy$.estimate,
control_rf_tuned_accuracy[1,5],
control_log_reg_accuracy$.estimate,
control_knn_tuned_accuracy[1, 4])
control_accuracies
## [[1]]
## [1] 0.82705
##
## $mean
## [1] 0.7494743
##
## [[3]]
## [1] 0.7700283
##
## $mean
## [1] 0.7350554
The first number is the decision tree, followed by the random forest, logistic regression, and k-th nearest neighbor model.
As we can see from this, the decision tree has the highest accuracy of 0.827. Logistic regression follows with 0.770, random forest with 0.750, and k-th nearest neighbor with 0.735.
I assume that Control’s models are much more accurate than the other two modes because kills are crucial to how Control plays. Thus, K/D is probably even more important to the models.
According to my models, the decision tree appeared to be the best model for all three of our game modes. To move forward, we will now finalize our model for each game mode and fit our test data.
I will begin fitting the final models with Hardpoint.
hp_metric <- metric_set(accuracy)
hp_model_test <- predict(hp_class_tree_final_fit, new_data = hp2019_test) %>%
bind_cols(hp2019_test %>% select(win))
hp_model_test
## # A tibble: 889 × 2
## .pred_class win
## <fct> <fct>
## 1 0 1
## 2 0 0
## 3 1 0
## 4 0 1
## 5 1 1
## 6 1 0
## 7 1 1
## 8 1 1
## 9 1 1
## 10 0 0
## # … with 879 more rows
hp_model_test %>%
hp_metric(truth = win, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.693
The final accuracy for Hardpoint test data was 0.693. This is less than what we got on our train data, but is roughly the same.
Next, I will fit the Search and Destroy test data.
snd_metric <- metric_set(accuracy)
snd_model_test <- predict(snd_class_tree_final_fit, new_data = snd2019_test) %>%
bind_cols(snd2019_test %>% select(win))
snd_model_test
## # A tibble: 699 × 3
## .pred_class player win
## <fct> <chr> <fct>
## 1 0 Abezy 0
## 2 0 Abezy 0
## 3 0 Abezy 0
## 4 0 Abezy 1
## 5 0 Abezy 0
## 6 1 Abezy 0
## 7 0 Abezy 0
## 8 1 Abezy 1
## 9 1 Abezy 1
## 10 0 Abezy 1
## # … with 689 more rows
snd_model_test %>%
snd_metric(truth = win, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.720
The final accuracy for Search and Destroy test data was 0.720. This is less than what we got on our train data, but is roughly the same.
Next, I will fit the Search and Destroy test data.
control_metric <- metric_set(accuracy)
control_model_test <- predict(control_class_tree_final_fit, new_data = control2019_test) %>%
bind_cols(control2019_test %>% select(win))
control_model_test
## # A tibble: 532 × 2
## .pred_class win
## <fct> <fct>
## 1 1 1
## 2 0 0
## 3 1 1
## 4 1 1
## 5 0 0
## 6 0 0
## 7 0 0
## 8 1 0
## 9 0 1
## 10 0 0
## # … with 522 more rows
control_model_test %>%
control_metric(truth = win, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.729
The final accuracy for Control test data was 0.729. This is much less than what we got on our train data, where the accuracy was 0.827.
The COD Champs tournament is the largest tournament of the year. During Black Ops 4, the two teams that competed in the grand finals were eUnited and 100 Thieves. I will be using my models to try and predict the winner of this grand finals.
First, I will predict the first Hardpoint map. I created a dataframe which only contained the data from the first map.
The following printed dataframes will show whether a player is predicted to win (1) or lose (0) and what team they are one.
map1_hp <- sqldf('SELECT * FROM hpChamps where end_time = "2019-08-18 21:19:01 UTC"')
map1_hp_prediction <- predict(hp_class_tree_final_fit, map1_hp) %>%
cbind(map1_hp$player, map1_hp$team)
map1_hp_prediction
## .pred_class map1_hp$player map1_hp$team
## 1 0 Enable 100 Thieves
## 2 0 Kenny 100 Thieves
## 3 1 Octane 100 Thieves
## 4 0 Priestahh 100 Thieves
## 5 1 Slasher 100 Thieves
## 6 1 Abezy eUnited
## 7 1 Arcitys eUnited
## 8 1 Clayster eUnited
## 9 1 Prestinni eUnited
## 10 1 Simp eUnited
My model predicted that Enable, Kenny, and Priestahh would lose, whereas Octane and Slasher would win. On the other hand, every player on eUnited was predicted to win by my model.
Even though Octane’s and Slasher’s statistics were individually adequate for the win, because of the rest of their team’s performance and because of the predictions for eUnited, I would predict that eUnited won the first Hardpoint map.
In reality, eUnited did win the first map 250-126! So far, so good.
My prediction: 1-0 eUnited
Reality: 1-0 eUnited
Next, I will look at Map 2, Search and Destroy.
map2_snd <- sqldf('SELECT * FROM sndChamps where end_time = "2019-08-18 21:37:20 UTC"')
map2_snd_prediction <- predict(snd_class_tree_final_fit, map2_snd) %>%
cbind(map2_snd$player, map2_snd$team)
map2_snd_prediction
## .pred_class map2_snd$player map2_snd$team
## 1 1 Enable 100 Thieves
## 2 0 Kenny 100 Thieves
## 3 0 Octane 100 Thieves
## 4 1 Priestahh 100 Thieves
## 5 0 Slasher 100 Thieves
## 6 1 Abezy eUnited
## 7 0 Arcitys eUnited
## 8 0 Clayster eUnited
## 9 0 Prestinni eUnited
## 10 1 Simp eUnited
My model predicted that Priestahh and Enable would win from 100 Thieves, and that Abezy and Simp would win from eUnited. This is an apparent flaw in my approach of determining an overall winner, as this in theory should be a tie. For the sake of extending the series, let’s just say that 100 Thieves wins. Thus, I would predict that 100 Thieves would win the second map.
In reality, 100 Thieves won the second map 6-3.
My prediction: 1-1 Tied
Reality: 1-1 Tied
Next, I will look at Map 3, Control.
map3_control <- sqldf('SELECT * FROM controlChamps where end_time = "2019-08-18 21:55:55 UTC"')
map3_control_prediction <- predict(control_class_tree_final_fit, map3_control) %>%
cbind(map3_control$player, map3_control$team)
map3_control_prediction
## .pred_class map3_control$player map3_control$team
## 1 0 Enable 100 Thieves
## 2 1 Kenny 100 Thieves
## 3 1 Octane 100 Thieves
## 4 1 Priestahh 100 Thieves
## 5 0 Slasher 100 Thieves
## 6 1 Abezy eUnited
## 7 0 Arcitys eUnited
## 8 0 Clayster eUnited
## 9 0 Prestinni eUnited
## 10 0 Simp eUnited
My model predicted that Kenny, Octane, and Priestahh would win from 100 Thieves, and only Abezy would win from eUnited. Thus, I would predict that 100 Thieves would win this match.
In reality, 100 Thieves did win the third map, 3-1! Another success, phew.
My prediction: 2-1 100 Thieves
Reality: 2-1 100 Thieves
Next, I will look at Map 4, Hardpoint.
map4_hp <- sqldf('SELECT * FROM hpChamps where end_time = "2019-08-18 22:11:21 UTC"')
map4_hp_prediction <- predict(hp_class_tree_final_fit, map4_hp) %>%
cbind(map4_hp$player, map4_hp$team)
map4_hp_prediction
## .pred_class map4_hp$player map4_hp$team
## 1 0 Enable 100 Thieves
## 2 1 Kenny 100 Thieves
## 3 1 Octane 100 Thieves
## 4 1 Priestahh 100 Thieves
## 5 0 Slasher 100 Thieves
## 6 1 Abezy eUnited
## 7 1 Arcitys eUnited
## 8 0 Clayster eUnited
## 9 0 Prestinni eUnited
## 10 1 Simp eUnited
My model predicted that Kenny, Octane, and Priestahh would win from 100 Thieves, and that Abezy, Arcitys, and Simp would win from eUnited. Again, this is an apparent flaw in my methodology. For the sake of extending the series once again, I’ll predict that eUnited will win Map 4.
In reality, eUnited wins map 4 250-219.
My prediction: 2-2 Tie
Reality: 2-2 Tie
This is what is all comes down to! My model has gotten us to this final map! (Albeit, through a different path.) Lastly, I will predict the winner of the final Search and Destroy to determine the winner of the whole event.
map5_snd <- sqldf('SELECT * FROM sndChamps where end_time = "2019-08-18 22:32:13 UTC"')
map5_snd_prediction <- predict(snd_class_tree_final_fit, map5_snd) %>%
cbind(map5_snd$player, map5_snd$team)
map5_snd_prediction
## .pred_class map5_snd$player map5_snd$team
## 1 0 Enable 100 Thieves
## 2 0 Kenny 100 Thieves
## 3 0 Octane 100 Thieves
## 4 0 Priestahh 100 Thieves
## 5 1 Slasher 100 Thieves
## 6 1 Abezy eUnited
## 7 1 Arcitys eUnited
## 8 1 Clayster eUnited
## 9 1 Prestinni eUnited
## 10 0 Simp eUnited
My model predicted that Slasher on 100 Thieves would win, and that Abezy, Arcitys, Clayster, and Prestinni would win from eUnited. Thus, eUnited is my overall predicted winner!
In reality, eUnited did win the final map 6-4! My model was correct in the end!
If you would like to see the actual series between 100 Thieves and eUnited, I have linked it below.
embed_url("https://www.youtube.com/watch?v=AfWqhymZnoc")
I expect that most of my models had lower accuracy for my test data compared to the train data because of overfitting. I had a feeling that this would happen since the final decision trees were much more complex than the ones before they were tuned. Regardless, I am happy with an accuracy of 0.70 for each model. For a game that is largely a team effort, being able to predict a win with 0.70 by only looking at one player’s statistics is not too bad.
Apart from all the research and code that is currently shown in this project, there were also hundreds of lines of code that didn’t make the cut. I tried multiple different methods of modeling, multiple different ways of organizing data, and multiple different metrics that weren’t included in the original data set. I tried to approach it from both an individual player’s perspective, as well as a team’s perspective. Ultimately, as the project discusses, I went with an individual player’s statistics. From this research, I have learned a lot about the predictive variables that determine a win or a loss for a player.
One thing that I immediately noticed was the massive importance of Kill/Death (KD) ratio. In practically every model, KD was the most important metric in determining a win or a loss. It overpowered most other metrics, which was a shock for me. For example, as discussed before, First Bloods in Search and Destroy are thought to be very important to a win. However, statistically, it didn’t even appear in some of the models.
Another thing that I noticed was how unimportant other metrics were. Some metrics that I thought would be important, weren’t. However, I would be interested in seeing how some of these metrics might be weighted differently if you approach it from a team perspective.
In the future, I believe that a more effective approach for predicting a win or a loss would be by looking at a team’s statistics as a whole. You can take this even further by ranking each team using their past statistics and using these to predict future matches; an unsupervised learning experiment. Additionally, by working with team data, it will clear up the issues that I ran into when predicting the Champs Grand Finals. My thinking was inherently wrong from the start and I didn’t even realize until it came time to predict everything.
Additionally, the data that I used for this research was from the CWL in 2018. Since then, the CWL has transformed into the CDL (Call of Duty League), with more focus being placed on data. There are more predictors to look at that just weren’t available for the CWL data. Future research should use CDL data as there are some interesting parameters, such as untraded kills, that might provide some more predictive power. I also think it would be interesting to look at interactions between teams and maps, and try and use this to aid in predicting a winner.
This project was a great opportunity to get hands on with the R language and machine learning concepts. Being able to work with Call of Duty data gave me a reason to keep working, and gave me an intrinsic pre-understanding of the data. There is still a lot to learn when it comes to machine learning, but as soon as I learn it, I hope to apply it to my future research ideas. All in all, it gave me a better understanding of a player’s statistics in Call of Duty and will give me an extra thing to think about whenever I watch future competitive matches.